Skip to content

Conversation

@treff7es
Copy link
Contributor

@treff7es treff7es commented Nov 7, 2025

Summary

This PR adds automatic lineage inference from DataHub to the Kafka Connect source connector. Instead of relying solely on connector manifests, the ingestion can now query DataHub's metadata graph to resolve schemas and generate both table-level and column-level lineage.

Motivation

Currently, Kafka Connect lineage extraction is limited by what's explicitly declared in connector configurations. This PR enables:

  1. Wildcard pattern expansion: Connectors configured with patterns like table.include.list: "database.*" can now be resolved to actual table names by querying DataHub
  2. Column-level lineage: Generate fine-grained lineage showing which source columns map to Kafka topic fields
  3. Schema-aware ingestion: Leverage existing metadata in DataHub to enrich Kafka Connect lineage without requiring external database connections
  4. Auto-enabled for Confluent Cloud: Schema resolver automatically enabled for Confluent Cloud environments where enhanced lineage is most valuable

Changes

New Configuration Options

Added three new configuration fields to KafkaConnectSourceConfig:

source:
  type: kafka-connect
  config:
    # Enable DataHub schema resolution
    # Auto-enabled for Confluent Cloud, disabled for OSS by default
    use_schema_resolver: true
    
    # Expand wildcard patterns to concrete table names (default: true)
    schema_resolver_expand_patterns: true
    
    # Generate column-level lineage (default: true)
    schema_resolver_finegrained_lineage: true

Auto-Enable for Confluent Cloud

New behavior: use_schema_resolver is automatically enabled when Confluent Cloud is detected via:

  • confluent_cloud_environment_id + confluent_cloud_cluster_id configuration
  • URI pattern matching (api.confluent.cloud/connect/v1/)

Users can opt-out by explicitly setting use_schema_resolver: false.

Core Components

  1. SchemaResolver Integration (connector_registry.py):

    • New create_schema_resolver() method to instantiate schema resolvers with platform-specific configurations
    • Automatically attaches resolvers to connector instances during creation
    • Passes pipeline context through the instantiation chain
  2. Fine-Grained Lineage Extraction (common.py):

    • New _extract_fine_grained_lineage() method in BaseConnector
    • Assumes 1:1 column mapping between source tables and Kafka topics (typical for CDC connectors)
    • Generates FineGrainedLineageClass instances for column-level lineage
  3. Enhanced Source Connectors (source_connectors.py):

    • Snowflake Source Connector: Pattern expansion support (e.g., ANALYTICS.PUBLIC.* → actual tables)
    • Debezium Connectors: Support for PostgreSQL, MySQL, SQL Server, MongoDB CDC
    • JDBC Source Connector: Generic JDBC source with pattern matching
    • Mongo Source Connector: MongoDB source with collection pattern expansion
    • ConfigDriven Source Connector: Generic connector for new/unsupported connector types
  4. Pattern Matching (pattern_matchers.py):

    • New module for consistent pattern matching across all connectors
    • Supports database wildcards (database.*, schema.table*, etc.)
    • Handles platform-specific naming conventions
  5. Configuration Constants (config_constants.py):

    • Centralized configuration key definitions
    • Reduces code duplication and typos
    • Easier maintenance and documentation
  6. Improved Topic Handling:

    • Refactored topic extraction to distinguish between topics from API vs Kafka cluster
    • Better handling of Confluent Cloud scenarios (no direct Kafka access)
    • Improved sink connector topic filtering

Code Quality Improvements

  • Removed redundant connector instantiation in topic derivation
  • Enhanced null safety with comprehensive checks before schema resolver access
  • Improved error handling with detailed logging and graceful fallbacks
  • Sanitized test data (removed real company names)
  • Better separation of concerns (API topics vs Kafka topics)

Usage

OSS Kafka Connect (Default Behavior)

source:
  type: kafka-connect
  config:
    connect_uri: "http://localhost:8083"
    # use_schema_resolver: false (default - no change in behavior)

Confluent Cloud (Auto-Enabled)

source:
  type: kafka-connect
  config:
    confluent_cloud_environment_id: "env-123"
    confluent_cloud_cluster_id: "lkc-456"
    # use_schema_resolver: true (auto-enabled!)
    # Pattern expansion and column-level lineage work out of the box

Confluent Cloud (Opt-Out)

source:
  type: kafka-connect
  config:
    confluent_cloud_environment_id: "env-123"
    confluent_cloud_cluster_id: "lkc-456"
    use_schema_resolver: false  # Explicitly disabled

OSS with Schema Resolver (Explicit Enable)

source:
  type: kafka-connect
  config:
    connect_uri: "http://localhost:8083"
    use_schema_resolver: true
    schema_resolver_expand_patterns: true
    schema_resolver_finegrained_lineage: true

Testing

  • Comprehensive test coverage across 15 unit test modules
  • ✅ All linting checks pass (ruff)
  • ✅ All type checks pass (mypy)
  • ✅ Integration tests pass for OSS and Confluent Cloud scenarios

Test modules:

  • test_kafka_connect.py - Core connector tests
  • test_kafka_connect_config_validation.py - Auto-enable logic tests (8 new tests)
  • test_kafka_connect_schema_resolver.py - Schema resolver integration
  • test_kafka_connect_snowflake_source.py - Snowflake connector tests
  • test_kafka_connect_pattern_matchers.py - Pattern matching tests
  • test_kafka_connect_config_constants.py - Configuration validation
  • test_kafka_connect_connector_registry.py - Connector registration tests
  • Plus 8 additional test modules for specific features

Documentation

  • Updated docs/sources/kafka-connect/kafka-connect.md with:
    • Auto-enable behavior explanation
    • Configuration examples for OSS and Confluent Cloud
    • Prerequisites and recommended ingestion order
    • Opt-out instructions

Breaking Changes

None - All features are opt-in (or auto-enabled only for Confluent Cloud). Existing Kafka Connect ingestions continue to work unchanged.

Default behavior:

  • OSS: use_schema_resolver: false (unchanged behavior)
  • Confluent Cloud: use_schema_resolver: true (auto-enabled, can be disabled)
  • No changes to existing connector behavior without explicit configuration

Prerequisites for Schema Resolver

IMPORTANT: For schema resolver to work, source database tables must be ingested into DataHub before running Kafka Connect ingestion. Without prior database ingestion, schema resolver will not find table metadata.

Recommended ingestion order:

  1. Ingest source databases (Postgres, MySQL, Snowflake, etc.) into DataHub
  2. Run Kafka Connect ingestion (with schema resolver enabled/auto-enabled)
  3. Enjoy enhanced lineage with column-level mappings!

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Nov 7, 2025
@codecov
Copy link

codecov bot commented Nov 7, 2025

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
5456 1 5455 32
View the top 1 failed test(s) by shortest run time
tests.integration.cassandra.test_cassandra::test_cassandra_ingest
Stack Traces | 19.1s run time
docker_compose_runner = <function docker_compose_runner.<locals>.run at 0x7fe6ce9e2ca0>
pytestconfig = <_pytest.config.Config object at 0x7fe7edc11290>
tmp_path = PosixPath('.../pytest-of-runner/pytest-0/test_cassandra_ingest0')
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7fe6ce412ed0>

    @pytest.mark.integration
    def test_cassandra_ingest(docker_compose_runner, pytestconfig, tmp_path, monkeypatch):
        # Tricky: The cassandra container makes modifications directly to the cassandra.yaml
        # config file.
        # See https://github..../cassandra/issues/165
        # To avoid spurious diffs, we copy the config file to a temporary location
        # and depend on that instead. The docker-compose file has the corresponding
        # env variable usage to pick up the config file.
        cassandra_config_file = _resources_dir / "setup/cassandra.yaml"
        shutil.copy(cassandra_config_file, tmp_path / "cassandra.yaml")
        monkeypatch.setenv("CASSANDRA_CONFIG_DIR", str(tmp_path))
    
>       with docker_compose_runner(
            _resources_dir / "docker-compose.yml", "cassandra"
        ) as docker_services:

.../integration/cassandra/test_cassandra.py:35: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
.../hostedtoolcache/Python/3.11.14....../x64/lib/python3.11/contextlib.py:137: in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
.../datahub/testing/docker_utils.py:65: in run
    with pytest_docker.plugin.get_docker_services(
.../hostedtoolcache/Python/3.11.14....../x64/lib/python3.11/contextlib.py:137: in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
venv/lib/python3.11........./site-packages/pytest_docker/plugin.py:212: in get_docker_services
    docker_compose.execute(command)
venv/lib/python3.11........./site-packages/pytest_docker/plugin.py:140: in execute
    return execute(command, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

command = 'docker compose --parallel -1 -f ".../integration/cassandra/docker-compose.yml" -p "pytest4967-cassandra" up --build --wait'
success_codes = (0,), ignore_stderr = False

    def execute(command: str, success_codes: Iterable[int] = (0,), ignore_stderr: bool = False) -> Union[bytes, Any]:
        """Run a shell command."""
        try:
            stderr_pipe = subprocess.DEVNULL if ignore_stderr else subprocess.STDOUT
            output = subprocess.check_output(command, stderr=stderr_pipe, shell=True)
            status = 0
        except subprocess.CalledProcessError as error:
            output = error.output or b""
            status = error.returncode
            command = error.cmd
    
        if status not in success_codes:
>           raise Exception(
                'Command {} returned {}: """{}""".'.format(command, status, output.decode("utf-8"))
            )
E           Exception: Command docker compose --parallel -1 -f ".../integration/cassandra/docker-compose.yml" -p "pytest4967-cassandra" up --build --wait returned 1: """ test-cassandra-load-keyspace Pulling 
E            test-cassandra Pulling 
E            7e49dc6156b0 Pulling fs layer 
E            7e27b670a0f5 Pulling fs layer 
E            070c1638c21b Pulling fs layer 
E            4e292c31f904 Pulling fs layer 
E            b5e329fb7a0e Pulling fs layer 
E            c48c21d441c7 Pulling fs layer 
E            aaaa5c9cd791 Pulling fs layer 
E            87de823001cd Pulling fs layer 
E            248c2e9e4d9f Pulling fs layer 
E            d127e9af0f85 Pulling fs layer 
E            4e292c31f904 Waiting 
E            b5e329fb7a0e Waiting 
E            c48c21d441c7 Waiting 
E            aaaa5c9cd791 Waiting 
E            87de823001cd Waiting 
E            248c2e9e4d9f Waiting 
E            d127e9af0f85 Waiting 
E            7e27b670a0f5 Downloading [>                                                  ]  164.4kB/16.15MB
E            7e49dc6156b0 Downloading [>                                                  ]  303.1kB/29.54MB
E            070c1638c21b Downloading [>                                                  ]  483.3kB/47.06MB
E            7e27b670a0f5 Downloading [============================>                      ]  9.236MB/16.15MB
E            070c1638c21b Downloading [=============>                                     ]  12.38MB/47.06MB
E            7e49dc6156b0 Downloading [==============>                                    ]  8.778MB/29.54MB
E            7e27b670a0f5 Verifying Checksum 
E            7e27b670a0f5 Download complete 
E            070c1638c21b Downloading [======================>                            ]  20.94MB/47.06MB
E            7e49dc6156b0 Downloading [============================>                      ]  16.97MB/29.54MB
E            4e292c31f904 Downloading [==================================================>]     156B/156B
E            4e292c31f904 Download complete 
E            7e49dc6156b0 Downloading [=================================================> ]  28.95MB/29.54MB
E            7e49dc6156b0 Verifying Checksum 
E            7e49dc6156b0 Download complete 
E            070c1638c21b Downloading [================================>                  ]  30.45MB/47.06MB
E            7e49dc6156b0 Extracting [>                                                  ]  327.7kB/29.54MB
E            b5e329fb7a0e Downloading [====================>                              ]     953B/2.282kB
E            b5e329fb7a0e Downloading [==================================================>]  2.282kB/2.282kB
E            b5e329fb7a0e Verifying Checksum 
E            b5e329fb7a0e Download complete 
E            070c1638c21b Downloading [============================================>      ]  42.35MB/47.06MB
E            c48c21d441c7 Downloading [===========================>                       ]     953B/1.734kB
E            c48c21d441c7 Downloading [==================================================>]  1.734kB/1.734kB
E            c48c21d441c7 Verifying Checksum 
E            c48c21d441c7 Download complete 
E            7e49dc6156b0 Extracting [=========>                                         ]  5.898MB/29.54MB
E            070c1638c21b Verifying Checksum 
E            070c1638c21b Download complete 
E            87de823001cd Downloading [>                                                  ]  8.214kB/784.7kB
E            aaaa5c9cd791 Downloading [>                                                  ]  119.1kB/11.5MB
E            7e49dc6156b0 Extracting [==================>                                ]  10.81MB/29.54MB
E            87de823001cd Downloading [==================================================>]  784.7kB/784.7kB
E            87de823001cd Verifying Checksum 
E            87de823001cd Download complete 
E            248c2e9e4d9f Downloading [>                                                  ]  532.5kB/72.77MB
E            aaaa5c9cd791 Downloading [========================================>          ]  9.396MB/11.5MB
E            7e49dc6156b0 Extracting [============================>                      ]  17.04MB/29.54MB
E            aaaa5c9cd791 Verifying Checksum 
E            aaaa5c9cd791 Download complete 
E            248c2e9e4d9f Downloading [========>                                          ]  12.26MB/72.77MB
E            d127e9af0f85 Downloading [======================================>            ]     953B/1.223kB
E            d127e9af0f85 Downloading [==================================================>]  1.223kB/1.223kB
E            d127e9af0f85 Verifying Checksum 
E            d127e9af0f85 Download complete 
E            7e49dc6156b0 Extracting [=========================================>         ]  24.58MB/29.54MB
E            248c2e9e4d9f Downloading [===============>                                   ]  21.84MB/72.77MB
E            7e49dc6156b0 Extracting [===========================================>       ]  25.89MB/29.54MB
E            248c2e9e4d9f Downloading [========================>                          ]  36.21MB/72.77MB
E            248c2e9e4d9f Downloading [===============================>                   ]  45.82MB/72.77MB
E            7e49dc6156b0 Extracting [=================================================> ]  29.16MB/29.54MB
E            7e49dc6156b0 Extracting [==================================================>]  29.54MB/29.54MB
E            248c2e9e4d9f Downloading [========================================>          ]  59.14MB/72.77MB
E            7e49dc6156b0 Pull complete 
E            7e27b670a0f5 Extracting [>                                                  ]  163.8kB/16.15MB
E            248c2e9e4d9f Downloading [===============================================>   ]  68.78MB/72.77MB
E            7e27b670a0f5 Extracting [=========>                                         ]  3.113MB/16.15MB
E            248c2e9e4d9f Verifying Checksum 
E            248c2e9e4d9f Download complete 
E            7e27b670a0f5 Extracting [=======================>                           ]  7.537MB/16.15MB
E            7e27b670a0f5 Extracting [======================================>            ]  12.29MB/16.15MB
E            7e27b670a0f5 Extracting [==============================================>    ]  14.91MB/16.15MB
E            7e27b670a0f5 Extracting [===============================================>   ]  15.24MB/16.15MB
E            7e27b670a0f5 Extracting [==================================================>]  16.15MB/16.15MB
E            7e27b670a0f5 Pull complete 
E            070c1638c21b Extracting [>                                                  ]  491.5kB/47.06MB
E            070c1638c21b Extracting [==========>                                        ]  10.32MB/47.06MB
E            070c1638c21b Extracting [=====================>                             ]  20.64MB/47.06MB
E            070c1638c21b Extracting [================================>                  ]  30.47MB/47.06MB
E            070c1638c21b Extracting [==========================================>        ]  39.81MB/47.06MB
E            070c1638c21b Extracting [==================================================>]  47.06MB/47.06MB
E            070c1638c21b Pull complete 
E            4e292c31f904 Extracting [==================================================>]     156B/156B
E            4e292c31f904 Extracting [==================================================>]     156B/156B
E            4e292c31f904 Pull complete 
E            b5e329fb7a0e Extracting [==================================================>]  2.282kB/2.282kB
E            b5e329fb7a0e Extracting [==================================================>]  2.282kB/2.282kB
E            b5e329fb7a0e Pull complete 
E            c48c21d441c7 Extracting [==================================================>]  1.734kB/1.734kB
E            c48c21d441c7 Extracting [==================================================>]  1.734kB/1.734kB
E            c48c21d441c7 Pull complete 
E            aaaa5c9cd791 Extracting [>                                                  ]  131.1kB/11.5MB
E            aaaa5c9cd791 Extracting [=========================>                         ]  5.898MB/11.5MB
E            aaaa5c9cd791 Extracting [=========================================>         ]  9.568MB/11.5MB
E            aaaa5c9cd791 Extracting [================================================>  ]  11.14MB/11.5MB
E            aaaa5c9cd791 Extracting [==================================================>]   11.5MB/11.5MB
E            aaaa5c9cd791 Pull complete 
E            87de823001cd Extracting [==>                                                ]  32.77kB/784.7kB
E            87de823001cd Extracting [==================================================>]  784.7kB/784.7kB
E            87de823001cd Extracting [==================================================>]  784.7kB/784.7kB
E            87de823001cd Pull complete 
E            248c2e9e4d9f Extracting [>                                                  ]  557.1kB/72.77MB
E            248c2e9e4d9f Extracting [==================>                                ]   27.3MB/72.77MB
E            248c2e9e4d9f Extracting [===================================>               ]  52.36MB/72.77MB
E            248c2e9e4d9f Extracting [==================================================>]  72.77MB/72.77MB
E            248c2e9e4d9f Pull complete 
E            d127e9af0f85 Extracting [==================================================>]  1.223kB/1.223kB
E            d127e9af0f85 Extracting [==================================================>]  1.223kB/1.223kB
E            d127e9af0f85 Pull complete 
E            test-cassandra Pulled 
E            test-cassandra-load-keyspace Error Head "https://registry-1.docker..../cassandra/manifests/latest": Get "https://auth.docker.io/token?scope=repository%3Alibrary%2Fcassandra%3Apull&service=registry.docker.io": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E           Error response from daemon: Head "https://registry-1.docker..../cassandra/manifests/latest": Get "https://auth.docker.io/token?scope=repository%3Alibrary%2Fcassandra%3Apull&service=registry.docker.io": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E           """.

venv/lib/python3.11........./site-packages/pytest_docker/plugin.py:37: Exception

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

impressive docs!

# Schema resolver configuration for enhanced lineage
use_schema_resolver: bool = Field(
default=False,
description="Use DataHub's schema metadata to enhance CDC connector lineage. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid introducing CDC as a new term to refer to Kafka Connect

CDC connector --> Kafka Connector
CDC sources/sinks -> Kafka Connect sources/sinks

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Nov 26, 2025
It's used when connectors don't have table.include.list configured, meaning they
capture ALL tables from the database.
The method first tries to use cached URNs from SchemaResolver (populated from
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

populated from previous ingestion runs? 🤔

# Use graph.get_urns_by_filter() to get all datasets for this platform
# This is more efficient than a search query and uses the proper filtering API
all_urns = set(
self.schema_resolver.graph.get_urns_by_filter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the schema resolver here

            all_urns = self.schema_resolver.get_urns()
``	`

* Is this Schema Resolver instance instantiated with the platform and env of the source database? anyway, either the resolver is populated with search such as the one below or it will be empty

and here

                self.schema_resolver.graph.get_urns_by_filter(
                    entity_types=["dataset"],
                    platform=platform,
                    platform_instance=self.schema_resolver.platform_instance,
                    env=self.schema_resolver.env,
                )

* here we just use the `graph` object, not actually using the Schema Resolver itself

I have the feeling that the usage of the Schema Resolver is very residual and only related to use its `DataHubGraph` object for doing the search.


Instead, I think it would be better to move down some of this pattern+search logic to the SchemaResolver itself. Or alternatively, just use the `DataHubGraph` object instead of making dependency with the SchemaResolver.

Comment on lines +803 to +805
# Filter by platform
if f"dataPlatform:{platform}" not in urn:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be necessary with the get_urns_by_filter search, no?

Comment on lines +807 to +812
# Filter by database - check if table_name starts with database prefix
if database_name:
if table_name.lower().startswith(f"{database_name.lower()}."):
# Remove database prefix to get "schema.table"
schema_table = table_name[len(database_name) + 1 :]
discovered_tables.append(schema_table)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this parsing in the URN.... I find it fragile
Couldn't we make a more specific search: eg search datasets in a given database (which should be a container)?


# Build target URN using DatasetUrn helper with correct target platform
target_urn = DatasetUrn.create_from_ids(
platform_id=target_platform,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

platform instance?


try:
# Get all URNs from schema resolver and filter for the source platform
# The cache may contain URNs from other platforms if shared across runs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean "shared accross runs"?

Comment on lines +1193 to +1220
if self.schema_resolver and self.schema_resolver.graph:
logger.info(
f"Kafka API unavailable for connector '{self.connector_manifest.name}' - "
f"querying DataHub for Kafka topics to expand pattern '{topics_regex}'"
)
try:
# Query DataHub for all Kafka topics
kafka_topic_urns = list(
self.schema_resolver.graph.get_urns_by_filter(
platform="kafka",
env=self.schema_resolver.env,
entity_types=["dataset"],
)
)

datahub_topics = []
for urn in kafka_topic_urns:
topic_name = self._extract_table_name_from_urn(urn)
if topic_name:
datahub_topics.append(topic_name)

matched_topics = matcher.filter_matches([topics_regex], datahub_topics)

logger.info(
f"Found {len(matched_topics)} Kafka topics in DataHub matching pattern '{topics_regex}' "
f"(out of {len(datahub_topics)} total Kafka topics)"
)
return matched_topics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We claim here using SchemaResolver but we are just using the grahp object and we are not even updating back SchemaResolver cache

As per my understanding, we should refresh the SchemaResolver cache with the search.
And ideally, we could move some logic down to the SchemaResolver class.

In SchemaResolver we have

    def resolve_table(self, table: _TableName) -> Tuple[str, Optional[SchemaInfo]]: ...

maybe we could add some new method to resolve:

  • urns and schemas for a given database
  • or, urns and schemas for a given regexp pattern

I think pushing down this kind of functionality to SchemaResolver, that would simplify code in the source and would help to have better separation of responsibilities.

Comment on lines +78 to +98
from datahub.sql_parsing.schema_resolver import SchemaResolver

# Get platform from connector instance (single source of truth)
platform = connector.get_platform()

# Get platform instance if configured
platform_instance = get_platform_instance(
config, connector.connector_manifest.name, platform
)

logger.info(
f"Creating SchemaResolver for connector {connector.connector_manifest.name} "
f"with platform={platform}, platform_instance={platform_instance}"
)

return SchemaResolver(
platform=platform,
platform_instance=platform_instance,
env=config.env,
graph=ctx.graph,
)
Copy link
Contributor

@sgomezvillamor sgomezvillamor Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in bigquery, we do

                return self.ctx.graph.initialize_schema_resolver_from_datahub(
                    platform=self.platform,
                    platform_instance=self.config.platform_instance,
                    env=self.config.env,
                    batch_size=self.config.schema_resolution_batch_size,
                )

which instantiates SchemaResolver + populates its cache

that would make sense here too, right?
have you considered?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good abstractions here! ❤️

)


class TestParseCommaSeparatedList:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice tests covering all cases
wondering if parse_comma_separated_list may be moved to some utils class

assert parse_comma_separated_list(input_str) == items


class TestConnectorConfigKeys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering the value of testing constants 😅

Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I haven't reviewed all files in detail, overall looks pretty good
I haven't been able to check coverage in codecov, just may be high given the amount of tests

My concern is on the usage of SchemaResolver. We claim using it even in the user docs and this is reflected also in the configs, however the usage seems mostly about using the graph object in the SchemaResolver.

  • we never refresh (or I missed it!) the internal cache of SchemaResolver (the caching would be one of the reasons to use SchemaResolver)
  • and we still do a lot of "resolution" logic out of the SchemaResolver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants